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There is evidence that biological synapses have only a fixed number of discrete weight states. 
Memory storage with such synapses behaves quite differently from synapses with unbounded, con- 
tinuous weights as old memories are automatically overwritten by new memories. We calculate the 
storage capacity of discrete, bounded synapses in terms of Shannon information. For optimal learn- 
ing rules, we investigate how information storage depends on the number of synapses, the number 
of synaptic states and the coding sparseness. 



Memory in biological neural systems is believed to be 
stored in the synaptic weights. Various computational 
models of such memory systems have been constructed 
in order to study their properties and to explore potential 
hardware implementations. Storage capacity, and opti- 
mal learning rules have been studied both for single-layer 
associative networks studied here, and for auto- 

associative networks Commonly, a synaptic weight 

in such models is represented by an unbounded contin- 
uous real number. However, more realistically, synaptic 
weights have values between some biophysical bounds. 
Furthermore, synapses might be restricted to occupy a 
limited number of synaptic states. Consistent with this, 
some experiments show that, physiologically, synaptic 
weight changes occur in steps [a, |6| ■ In contrast to net- 
works with continuous, unbounded synapses, in networks 
with discrete, bounded synapses old memories are over- 
written by new ones, in other words, the memory trace 
decays @,IS@|. 

It is common to use the signal-to-noise ratio (SNR) 
to quantify memory storage [3, E3]. When weights are 
unbounded, each stored pattern has the same SNR, and 
storage can simply be defined as the maximum number 
of patterns for which the SNR is larger than some fixed, 
minimum value. For discrete, bounded synapses perfor- 
mance must be characterized by two quantities: the ini- 
tial SNR, and its decay rate. Altering the learning rules 
typically results in either 1) a decrease in initial SNR but 
a slower decay of the SNR (i.e. an increase in memory life- 
time) [10], or 2) an increase in initial SNR but a decrease 
in memory lifetime. Optimization of the learning rule is 
ambivalent because an arbitrary trade-off must be made 
between these two effects. 

The conflict between optimizing learning and optimiz- 
ing forgetting can be resolved by analyzing the capacity 
of synapses in terms of Shannon information. Here we 
describe a framework for calculating the information ca- 
pacity of bounded, discrete synapses, and use it to find 
optimal learning rules. We model a single neuron, and 
investigate how the information capacity depends on the 
number of synapses and the number of synaptic states, 
both for dense and sparse coding. 

We consider a single neuron which has n inputs. At 
each time step it stores a n-dimensional binary pattern 



with independent entries x a , a = l...n. The sparsity 
p corresponds to the fraction of entries in x that cause 
strengthening of the synapse. It is optimal to set the low 
state equal to —p, and the high state to q =: (1 — p), so 
that the probability density for inputs is given by P{x) = 
qS(x+p)+pS(x — q) and (x) = 0. Thecasep= \ we term 
dense, furthermore, we assume that p < |, as the case 
p > \ is fully analogous. Although biological coding is 
believed to be sparse, we briefly note that in biology the 
relation between p and coding sparseness is likely very 
complicated. 

Each synapse occupies one of W states. The cor- 
responding values of the weight are assumed to be 
equidistantly spaced around zero and are written as a 
W— dimensional vector, i.e. for a 3-state synapse w = 
{ — 1, 0, 1}, while for a 4-state synapse w = {—2,-1, 1, 2}. 
In numerical analysis we have sometimes seen an increase 
in information by varying the values of the weight states, 
however this increase was always small. Note, that w 
is very different from the definition of a "weight vector" 
commonly used in network models. 

The learning paradigm we consider is the following: 
during the learning phase a pattern is presented each 
time step, and the synapses are updated in an unsuper- 
vised manner. The learning algorithm is on-line, i.e. the 
synapses can only be updated when the pattern is pre- 
sented. As bounded, discrete synapses store new memo- 
ries at the expense of overwriting old ones, we can assume 
that sufficient patterns are stored such that the earliest 
pattern has almost completely decayed and the distribu- 
tion of the synaptic weights has reached an equilibrium. 

After learning, the neuron is tested on learned and 
novel patterns. Presentation of a learned pattern will 
yield an output which is on average larger than that for 
a novel pattern. The presentation of a novel, random 
pattern {a;."} leads to a signal h u — Yl a x u u> <^ where 
the weights are w a , a = 1, . . . , n. As this novel pattern 
will be uncorrelated to the weight, it has mean (h u ) = 
n (x) (w) = 0, and variance 



(Ah 2 l )=n (x 2 w 2 ) — (x) 2 (w) 2 —npq(w 2 ) 
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where (w) — w.tv 00 , (w 2 ) = J2i=i 
equilibrium distribution of weights. 



and tv°° is the 
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Because the synapses are independent, and the perfor- 
mance is characterized statistically, we can use Markov 
transition matrices to define the learning 1(J HI- ^ m 
the learning phase an input is high (low), the synapse 
is updated according to the matrix M + (M~), Thus, 
the distribution of potentiated weights immediately af- 
ter a high input is Tv + (t = 0) = A/ + 7r°°. As sub- 
sequent, uncorrelated, patterns are learned, this signal 
decays according to TV + (t) = M*7r + (< = 0), where 
M =: pM + + qM~ is the expected update matrix at 
each time-step. The equilibrium distribution tv°° is iden- 
tical to the eigenvector of M with eigenvalue one. The 
mean signal for learned patterns is 



(h t ) (t) = npqw T M t (M+ - M-)tv c 



(2) 



This signal decays such that the synapses contain most 
information on more recent patterns. The decay is typi- 
cally exponential, with a time constant equal to the sub- 
dominant eigenvalue of M, 

When tested with an equal mix of learned and un- 
learned patterns, the mutual information in the neuron's 
output about whether a single pattern is learned or not 
is 



= £ P(s)P(h\ S )lo g2 ^ 
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where Pe (P u ) denotes the distribution of the output of 
the neuron to learned (unlearned) patterns. If the two 
output distributions are perfectly separated, the learned 
pattern contributes one bit of information, whilst total 
overlap implies zero information storage. As the patterns 
are independent, the total information is the sum of the 
information over all patterns presented during learning. 

Unfortunately, the full distributions of h are compli- 
cated multinomials. Furthermore, it would be challeng- 
ing for a biological readout to distinguish between two 
aribitrary distributions. Instead we impose a threshold 
between two possible responses, which could, say, cor- 
respond to the neuron firing or not. If the number of 
synapses is large, we can approximate the distribution 
of h with a Gaussian and the information reduces to a 
function of the SNR 

/(*) = ! + r(t) log 2 r(t) + [1 - r(t)} log 2 [1 - r(t)} , (4) 



where r(f) = |erfc( A /SNR(t)/8), and the SNR is defined 
as 



SNR(t) 



2 ((h e ) (t) - (h u )f 
(Ahj) (t) + (Ahl) 



(5) 



In the numerical simulations we use Eqs. and ((5j) , but 
for the analytical expressions we assume the same vari- 
ance of the output for learned and unlearned patterns, 



(Ahj) (t) w (Ahl). Importantly, the information (@]) 
is a saturating function of the SNR, and for very high 
SNR, the information is approximately one bit. Mean- 
while for small SNR, the information is linear in the SNR, 
J« SNR/(47rln2). 

The total information per synapse is obtained by 
summing together the information of all patterns and 
dividing by the number of synapses, thus Is —■ 

^E^ J [ SNR ^)]- In cases in which the initial SNR is 
very low 
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£SNR(i). 
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In the opposite limit, when the initial SNR is very high, 
recent patterns contribute one bit. We approximate as if 
all patterns with more than 1/2 bit actually contribute 
one bit, whilst all patterns with less information con- 
tribute nothing. In this limit the information thus equals 
the number of patterns with more than 1/2 bit of infor- 
mation 



T tc 
JS = — 

n 



(7) 



where t c is implicitly defined as I(t c ) — 1/2. 

The storage capacity depends on the W x W learn- 
ing matrices M + and M~. To find the maximal stor- 
age capacity we need to optimize these matrices, and 
this optimization will in general depend on the sparse- 
ness, the number of synapses, and the number of states 
per synapse. Because these are Markov transition matri- 
ces, their columns need sum to one, leaving W(W — 1) 
free variables per matrix. For dense patterns (p = 
1/2) one can impose additional symmetry (M + )ij = 
(M~)w-i.w~j- 

In the case of binary synapses we write 
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We first consider the limit of few synapses, for which the 
initial SNR is low, and use (J6j) to compute the informa- 
tion. ( We keep np > 1 and n > 10 to ensure that there 
are sufficient distinct patterns to learn.) We find 



pq 



flfi 
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7rln2(p/ + + g/_)3 2-p/ + - (7 /_ 



(9) 



The values for / + and /_ that yield maximal information 
depend on the value of the density p. For 0.11 < p < 0.89, 
one has /+ = /_ = 1, which gives equilibrium weight 
distribution tt°° = (q,p) T , and 



Is 



pq 

7rln2 



(10) 



In this case the synapse is modified at every time-step 
and only retains the most recently presented pattern; the 
information stored on one pattern drops to zero as soon 
as the next pattern is learned. 
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For sparser patterns p < 0.11 a second solution to 
Eq. © is optimal, for which /+ = 1, /_ w 2p. I.e. poten- 
tiation occurs for every high input, but given a low input, 
depression only occurs stochasticly with a probability 2p. 
As a result, forgetting is not instantaneous and the SNR 
decays exponentially with time constant r = 1/ (6p) . The 
associated weig ht distribution ir°° « (2/3, 1/3) T , which 
is interesting to compare to experiments in which about 
80% of the synapses were found to be in the low state 0. 
The information per synapse is 



Is = 



7rm2 v 27 9' 



(11) 



There are two important observations to be made from 
Eqs. IjlOHlip : 1) the information remains finite at low p, 
2) as long as the approximation is valid, each additional 
synapse contributes to the information. 

We next consider the limit of many synapses, for which 
the initial SNR is high. With Eq. we find 



Is = 



2 In [1 



In 



(f+P + f-q) 2 



4npq fifi 



(12) 



where s w 6.02 is defined as the value of the SNR which 
corresponds to 1/2 bit of information. The optimal learn- 
ing parameters are in this limit /+ = ey/sq/pn and /_ = 
e^fspjqn, leading to an equilibrium weight distribution 
7r°° = (1/2, 1/2) T . In this regime the learning is stochas- 
tic, with the probability for potentiation/depression de- 
creasing as the number of synapses increases. The corre- 
sponding information is 



1 



0.075 



2e^/spqn 



(13) 



Hence, as n becomes large, adding extra synapses no 
longer leads to substantial improvement in the informa- 
tion storage capacity. The memory decay time constant 
is r = y/n/(4ey/spq). 

To verify the above results, we carried out a numeri- 
cal optimization of learning matrices. We find there is 
a smooth interpolation between the two limiting cases, 
and for a given sparseness there is a critical number of 
synapses beyond which the addition of further synaptic 
inputs does not substantially improve information stor- 
age capacity. This occurs when the initial SNR becomes 
of order 1. For dense patterns, this occurs for just a 
few synapses, whilst for sparse patterns this number is 
proportional to see Fig. [1] 

It is interesting to compare the storage capacity found 
here with that of a Willshaw net [l[, which also involves 
binary synapses. In the Willshaw model, prior to learn- 
ing, all synapses occupy the low state, whilst the learning 
process consists solely of potentiation of certain synapses. 
This means that as more patterns are presented, more 
synapses move to the up state, and eventually all memo- 
ries are lost. However, when only a finite, optimal num- 
ber of patterns are presented, such synapses perform well. 
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Figure 1: Information capacity of binary synapses. Informa- 
tion storage capacity per synapse versus the number of synap- 
tic inputs, for dense (p=0.5), sparse (p=0.05) and very sparse 
(p=0.005) coding. Lines show analytic results, whilst points 
show numerical results. For small number of synapses, each 
additional synapse contributes to the information. However, 
for many synapses, the information per synapse decreases as 
l/y/n. 



When the task is to successfully reproduce an output pat- 
tern associated with each input pattern, the maximum 
capacity is 0.69 bits/synapse 0, Q3]. When measured 
for a binary response, as in the framework of this paper, 
this value is reduced. In the limit of few synapses, and 
sparse patterns, the storage capacity is approximately 
0.11 bits/synapse, which is several times higher than 
the storage we obtain here. However, as the number of 
synapses increases, the storage capacity becomes propor- 
tional to rt -1 , a faster decay than the n~ 1/2 we find for 
our case. 



Next, we examine whether storage capacity increases 
as the number of synaptic states increases. Even under 
small or large n approximations, the information is in 
general a complicated function of the learning parame- 
ters, due to the complexity of the invariant eigenvector 
7r°° of the general Markov matrix M. Thus, the optimal 
learning must be found numerically by explicitly varying 
all matrix elements. For large n we find that the optimal 
transfer matrix is band diagonal, with the only transi- 
tions allowed being one-step potentiation or depression. 
Moreover, we find that for a fixed number of synaptic 
states, the (optimized) information storage capacity be- 
haves similarly to that for binary synapses. In the dense 
(p = 1/2) case, in the limit of many synapses, the optimal 



4 



learning rule takes the simple form 

/2-/ 1 

/ 1 

1 

I 1 



p=0.5 



M = 



V 



i o 

1 / 
1 2-fJ 



(14) 



with / = e\Js/n . The equilibrium weight distribution 
is, somewhat surprisingly, peaked at both ends, and is 
low and flat in the middle, tt 00 oc (1, /,/,..., /,1) T . The 
information is 

W-l fn W-l 
J s = -7T f — In = — = , 15 

Ijn s ey/sn 

and the corresponding time constant for the SNR is given 
by t — (W — l)y/n/s/(2e). Validity of these results 
requires fW to be small, to enable series expansion in /. 
Hence, we find that information grows linearly with the 
number of synaptic states, provided Wj\fn <C \j{e.\fs). 

There appears to be no simple optimal transfer matrix 
in the sparse case, even in the large n limit. However, 
a formula for the storage capacity which fits well with 
numerical results and is consistent with equations lfT3|) 
and (H) is 

W-l 

h = -^-=. (16) 
le^/spqn 

Assuming that this formula, as for the binary synapse, is 
the leading term in a series expansion in the two param- 
eters /+ = ey/sq/pn and /_ = e^/sp/qn, and that we 
need W /+ and Wf- small for it to be accurate, then its 
validity condition is W ' \Jqjnp -C l/(ey / s). 

Numerical results agree with the equations above, and 
are illustrated together with the analytic results in Fig.[2l 
Thus, for fixed number of synapses, storage initially 
grows linearly with W. However, as W becomes larger, 
capacity saturates and becomes independent of W. This 
behavior is consistent with that of a number of dif- 
ferent (sub-optimal) learning rules studied in Ref. [lB |. 
These learning rules had the property that the product 
of the initial SNR and the time-constant r of SNR de- 
cay is independent of W (see Table 1 in [IB] for this 
remarkable identity, noting that the SNR there equals 
its square root here). For large W, or equivalently small 
n, the initial SNR is small, and hence the information 
J ~ £ t SNR(O) exp(-i/r) - SNR(0)t is independent of 
W, as observed here, Fig. [2j 

It is interesting to note that even unbounded synapses 
store only a limited amount of information. In the frame- 
work of this paper, the optimal local learning rule for 
unbounded synapses (optimized as in 0) yields SNR = 
n/m, where m is the number of patterns. This cor- 
responds to storing 0.11 bits/synapse in the case that 
m>n>l. 
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Figure 2: Information capacity of multi-state synapses. Infor- 
mation storage capacity per synapse versus the number Wot 
synaptic states, for dense (p = 0.5) and sparse (p = 0.05) 
coding. Lines show analytic results (when available), whilst 
points show numerical results. For small number of synaptic 
states, storage capacity is increasing, but for larger numbers 
saturates. 



Finally we study, in the large n limit, the performance 
of a simple "hard-bound" learning rule, i.e. a learning 
rule with uniform equilibrium weight distribution. Under 
this rule, whose SNR dynamics were previously studied 
in Ref. [10], a positive (negative) input gives one step 
potentiation (depression) with probability /+ (/-). For 
W > 4 the optimal probabilities satisfy f + p = f-q « 

3n) and the information 



ey/sWy/(W +WW 
storage capacity is 



1 



spqn(W+l) 



eW 



l)/(2 



0.053x- 
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Here the latter approximation is for large W. This sub- 
optimal learning rule gives an information capacity of 
the same functional form as the optimal learning rule, 
but performs only 70% as well. 

Given that simple stochastic learning performs almost 
as well as the optimal learning rule, one may wonder 
how well a simple deterministic learning rule performs 
in comparison. For certain potentiation and depression, 
/+ = /_ = !, one finds 



W 2 , / Yin 
TT z n \W z s 



(18) 



Although this grows faster with W, the 1/n behavior 
means this performs much worse than optimal stochastic 
learning rules. The memory decay time is in this case 

T = W 2 /lT 2 . 

In summary, learning using bounded, discrete weights 
can be analyzed using the Shannon information. There 
are two regimes: 1) when the number of synapses is small, 
the information per synapse is constant and approxi- 
mately independent of the number of synaptic states, 
2) when the number of synapses is large, the capacity 
per synapse decreases as \j\fn. Furthermore, we find 
that in the second regime, the optimal transition matri- 
ces are band diagonal. In particular, in the dense case 
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(p = 1/2), the matrix Eq. lfl4|) is optimal. The critical n 
that separates these regimes is dependent on the sparse- 
ness and the number of weight states. Although there 
are currently no good biological estimates for either the 
number of weight states, or the sparsity, these results 
might indicate that the number of synapses is limited to 
prevent sub-optimal storage. When increasing the num- 
ber of synaptic states, we find that for small numbers 
the information grows linearly, while for larger numbers 
it levels off and reaches a saturation point. 
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